Silas Kwok - Python Data Science Project Overview

(note: Full code can be found at the bottom of the document)

The problem I'm solving is the problem of not knowing where countries stand relative to one another in terms of their population and GDP. Additionally, how do groups of countries compare based on their government type?

Information sourced from "country_per_cap_gdp_unemployment_gov_type_pop.csv" on https://github.com/ghenshaw/datasets/blob/master/country_per_cap_gdp_unemployment_gov_type_pop.csv

Sample information:

Screen%20Shot%202021-11-30%20at%209.14.59%20PM.png

This information is interesting to me because it shows how countries compare in terms of standard economic metrics, like unemployment rate, gdp per capita, and population, and elucidates the living conditions of a country's people in the aggregate.

Application of Systematic Program Design & Design Choices

I applied the skills I learned in creating proper data definition and functions to solve this problem, including Compound data type (for Country data type) and Abitrary Sized data type (for List of Country data type). I also learned how to keep my project organized and easily legible by following the one task per function rule, and applied the how to design analysis programs and visualization to create the scatterplot I wanted to, with the correct legend, labels, and data I wanted.

A design choice I made was to separate data points on my graph by GovType, which required me to apply the one task per function rule by separating the validation (is_govtype function) from the filtering (filter_govt_populations, and filter_govt_gdps). I made this choice because I think the plot would be more interesting to look at than if the data were all the same coloured points.

Screen%20Shot%202021-11-30%20at%209.43.20%20PM.png

Screen%20Shot%202021-11-30%20at%209.43.35%20PM.png

Data Visualization

Screen%20Shot%202021-11-30%20at%209.12.38%20PM.png

Problem Solving

The part of the project that was the most difficult for me was figuring out how to colour the data points by government type. This was because I had tried to create "get (insert govtype here) populations" & "get (insert govtype here) gdps" functions for each govtype, but that led me to create a various helper functions like "is_govtype_republic".

I found that this option was too clunky and didn't work too well, so I figured out that I could make the code much more efficient by using an "is_govtype" function, which takes a Country (which is an enum which has GovType as one of its parts), as well as a GovType. Then I could use this function in "filter_govt_gdps" and "filter_govt_populations". Thus, if I picked GovType.co (Communist State), the functions would look through the List of Country, and add on either the gdp (calculated by multiplying population and gdp_per_capita), or population into a list of those values only for Country with GovType.co, which would later be used as either the x axis (population) or y axis (gdp). I applied this function to each type of GovType, and plotted them to create the graph I wanted.

Limitations

A great (albeit challenging!) and likely rewarding task would be to improve the graph by scaling the data (perhaps with a log scale) so that data points in the bottom left cluster are more easily visible. Another highly valuable addition would be to add country names beside each of these points so that a lot more can be taken away from just looking at the graph.

Potential future projects/applications of systematic programming design skills

Some examples of ways I could use systematic programming design skills to solve problems in your chosen topic area (economics) in the future include:

  1. Looking through and analyzing economics csv. files (e.g. gini coefficient, inequality, financial indicators) and finding the highest or lowest of any variable.
  2. Finding the sum of a certain type of data (e.g. only data from countries with an unemployment rate <= 15%).
  3. Manipulating economic data to discover new insights about countries or organizations for which I have data.
  4. Plotting economic data in a way that is easily comprehensible and or which trends or outliers can be observed.

Planning + Full Code

Step 1a: Planning

The available information in 'country_per_cap_gdp_unemployment_gov_type_pop.csv' includes:

The information in the file that my program will read are

The unusual features in the data are that Libya and Thailand have "n/a" as their government type, and there are 4 absolute monarchies (UAE, Brunei, Swaziland, and Saudi Arabia). Also, the values in the population column are separate by commas every 3 digit places, while the values in GDP per capita are not.

Step 1b: Planning

Brainstorm program ideas + Select Program

The initial ideas in my proposal were (1) a bar graph (with country on the x axis grouped by government type and GDP on the y axis), (2) a scatterplot (with unemployment rates on the x axis and GDP per capita on the y axis), and (3) a scatterplot (with unemployment rates on the x axis and GDP per capita on the y-axis, with countries coloured by continent).

Some ideas building on this initial brainstorming (THAT INCLUDE SUBSTANTIAL COMPUTATION, since GDP = gdp_per_capita * population) are:

I could (1) group the bar graph with GDP on the y-axis with country on the x axis grouped by government type (with GDP in descending order).

I could do also (2) a scatterplot of GDP vs. population, with each country coloured by government type.

I could do (3) a pie chart with each continent's share of GDP (or country's share of GDP, if information is still clear in that case).

I could also do (4) a bar graph where I group the data by continent and take the average of each continent's GDP so that we can compare continents and countries by their GDP.

There isn't a column with the data on continents, so (3) and (4) would require information outside of the data set. Since there are only a few government types and some government types have very few data points (e.g. absolute monarchies) and missing data like N/A for some countries' government type, (1) may not be the best option in terms of delivering some insight on the data.

Thus, I choose to do (2), because it will be interesting to see how different countries and different government types compare in terms of their GDP and population.

To pursue option (2), I would need to

  1. Make a new column called GDP, which would be calculated by multiplying "population" by "gdp per capita". This would be the y-axis of my scatterplot, while population would be my x-axis.
  2. Plot the GDP against the population, and assign each point either a different colour (or shape) based on its government type, as well as label each point with its name (if space permits).

Step 1c: Planning

An example of what my program would produce

IMG_3567.JPG

Step 2: Designing data definitions

The information in the file I choose to represent is a country's name, gdp_per_capita, gov_type, and population. This information is important because GDP (as the y-axis) will be determined by finding the product of gdp_per_capita and population, population will be the x-axis, and gov_type will be used to colour the data points while country (name) will be used to label each point on the scatterplot if space permits.

Step 3: Designing functions